{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "53cbf8c7-5e33-4a0e-b10e-33da5127c6af",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-01T07:46:03.804799Z",
     "iopub.status.busy": "2024-10-01T07:46:03.803991Z",
     "iopub.status.idle": "2024-10-01T07:46:03.828404Z",
     "shell.execute_reply": "2024-10-01T07:46:03.827755Z",
     "shell.execute_reply.started": "2024-10-01T07:46:03.804776Z"
    },
    "tags": []
   },
   "source": [
    "# Aria Inference Recipes\n",
    "\n",
    "Here is an VLLM-version of the inference recipe, aiming to facilitate users with faster inference speed. \n",
    "\n",
    "## Section 3: Multi-page PDF Understanding (VLLM)\n",
    "\n",
    "We here show the best recipes to understand a multi-page PDF (e.g. ArXiv papers, financial reports, slides, scanned books) with Aria model. We use the paper of [LongVideoBench](https://arxiv.org/pdf/2407.15754) (Jul 24', ^1) as an example, to show an end-to-end tutorial from a `.pdf` file to various types of responses. \n",
    "\n",
    "By default, we use split-image settings as images in PDFs are information-rich.\n",
    "\n",
    "^1: As per knowledge cutoff of the model, this paper has never been seen during training.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81989981-a740-4597-99e4-a02f1f99b9da",
   "metadata": {
    "tags": []
   },
   "source": [
    "\n",
    "### [General] Load Model and Processor\n",
    "\n",
    "To maximize the actual length that Aria (VLLM version) can infer on a single 80GB GPU, we set the recommended parameter as follows:\n",
    "\n",
    "- `max_model_len`: 38400\n",
    "- `gpu_memory_utilization`: 0.84\n",
    "\n",
    "This will allow a very long input with up to 64 high-resolution (980 resolution) or 256 mid-resolution (490 resolution) images to be fed as inputs of Aria with only one GPU, which will cover our long-context evaluation cases in Sections 3 and 4. Enjoy!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c6b3a68b-a029-4df4-86fc-4d74997d5109",
   "metadata": {
    "ExecutionIndicator": {
     "show": false
    },
    "execution": {
     "iopub.execute_input": "2024-10-04T12:54:37.657573Z",
     "iopub.status.busy": "2024-10-04T12:54:37.657277Z",
     "iopub.status.idle": "2024-10-04T12:55:22.689457Z",
     "shell.execute_reply": "2024-10-04T12:55:22.688698Z",
     "shell.execute_reply.started": "2024-10-04T12:54:37.657550Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/root/miniconda3/envs/aria/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "2024-10-04 20:54:40,866\tINFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 10-04 20:54:41 config.py:1652] Downcasting torch.float32 to torch.bfloat16.\n",
      "WARNING 10-04 20:54:41 arg_utils.py:940] The model has a long context length (38400). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.\n",
      "WARNING 10-04 20:54:41 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used\n",
      "INFO 10-04 20:54:41 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/cpfs/29cd2992fe666f2a/user/zhoufan/yivl_open_source/models/uf_sft_0929_seqlen8k_from_sft0916_afterlong_iden_1600', speculative_config=None, tokenizer='/cpfs/29cd2992fe666f2a/user/zhoufan/yivl_open_source/models/uf_sft_0929_seqlen8k_from_sft0916_afterlong_iden_1600', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=38400, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/cpfs/29cd2992fe666f2a/user/zhoufan/yivl_open_source/models/uf_sft_0929_seqlen8k_from_sft0916_afterlong_iden_1600, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)\n",
      "INFO 10-04 20:54:44 model_runner.py:1014] Starting to load model /cpfs/29cd2992fe666f2a/user/zhoufan/yivl_open_source/models/uf_sft_0929_seqlen8k_from_sft0916_afterlong_iden_1600...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading safetensors checkpoint shards:   0% Completed | 0/12 [00:00<?, ?it/s]\n",
      "Loading safetensors checkpoint shards:   8% Completed | 1/12 [00:02<00:27,  2.53s/it]\n",
      "Loading safetensors checkpoint shards:  17% Completed | 2/12 [00:05<00:27,  2.74s/it]\n",
      "Loading safetensors checkpoint shards:  25% Completed | 3/12 [00:08<00:24,  2.71s/it]\n",
      "Loading safetensors checkpoint shards:  33% Completed | 4/12 [00:10<00:21,  2.74s/it]\n",
      "Loading safetensors checkpoint shards:  42% Completed | 5/12 [00:13<00:18,  2.70s/it]\n",
      "Loading safetensors checkpoint shards:  50% Completed | 6/12 [00:15<00:13,  2.32s/it]\n",
      "Loading safetensors checkpoint shards:  58% Completed | 7/12 [00:18<00:12,  2.52s/it]\n",
      "Loading safetensors checkpoint shards:  67% Completed | 8/12 [00:20<00:10,  2.56s/it]\n",
      "Loading safetensors checkpoint shards:  75% Completed | 9/12 [00:23<00:07,  2.56s/it]\n",
      "Loading safetensors checkpoint shards:  83% Completed | 10/12 [00:26<00:05,  2.66s/it]\n",
      "Loading safetensors checkpoint shards:  92% Completed | 11/12 [00:28<00:02,  2.67s/it]\n",
      "Loading safetensors checkpoint shards: 100% Completed | 12/12 [00:31<00:00,  2.72s/it]\n",
      "Loading safetensors checkpoint shards: 100% Completed | 12/12 [00:31<00:00,  2.64s/it]\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 10-04 20:55:17 model_runner.py:1025] Loading model weights took 47.1793 GB\n",
      "WARNING 10-04 20:55:17 model_runner.py:1196] Computed max_num_seqs (min(256, 38400 // 65536)) to be less than 1. Setting it to the minimum value of 1.\n",
      "INFO 10-04 20:55:19 gpu_executor.py:122] # GPU blocks: 2650, # CPU blocks: 936\n"
     ]
    }
   ],
   "source": [
    "# load Aria model & tokenizer with vllm\n",
    "\n",
    "import os\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n",
    "\n",
    "import requests\n",
    "import torch\n",
    "from PIL import Image\n",
    "\n",
    "from transformers import AutoTokenizer\n",
    "from vllm import LLM, SamplingParams\n",
    "\n",
    "\n",
    "model_id_or_path = \"rhymes-ai/Aria\"\n",
    "\n",
    "model = LLM(\n",
    "        model=model_id_or_path,\n",
    "        tokenizer=model_id_or_path,\n",
    "        dtype=\"bfloat16\",\n",
    "        limit_mm_per_prompt={\"image\": 256},\n",
    "        enforce_eager=True,\n",
    "        trust_remote_code=True,\n",
    "        max_model_len=38400,\n",
    "        gpu_memory_utilization=0.84,\n",
    "    )\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\n",
    "        model_id_or_path, trust_remote_code=True, use_fast=False\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e31f164-08fa-403c-8e7d-4be4e1063f56",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "87a494b4-86de-4c34-953d-55569adc1cb3",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "a90e837d-1f82-419f-b562-8c188150e107",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Installing PyMUPDF & Defining PDF2Image Function\n",
    "\n",
    "To convert a PDF into images, we use the PyMUPDF package. We install it as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74b4ffc7-8ceb-460b-b3cc-206180c3ef7e",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%sh\n",
    "pip uninstall PyMUPDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "553aac05-4a13-4fff-a57c-ef78a32867ca",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-04T12:55:22.690661Z",
     "iopub.status.busy": "2024-10-04T12:55:22.690455Z",
     "iopub.status.idle": "2024-10-04T12:55:22.758640Z",
     "shell.execute_reply": "2024-10-04T12:55:22.757764Z",
     "shell.execute_reply.started": "2024-10-04T12:55:22.690644Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import fitz  # PyMuPDF\n",
    "from PIL import Image, ImageFile\n",
    "\n",
    "def pdf_to_images(pdf_path):\n",
    "    # Open the PDF file using PyMuPDF\n",
    "    doc = fitz.open(pdf_path)\n",
    "    \n",
    "    # Store each page as a PIL image\n",
    "    images = []\n",
    "    \n",
    "    for page_num in range(doc.page_count):\n",
    "        page = doc.load_page(page_num)\n",
    "        \n",
    "        # Convert page to a pixmap (image representation in PyMuPDF)\n",
    "        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))\n",
    "        \n",
    "        # Create a PIL image from the pixmap's byte data\n",
    "        img = Image.frombytes(\"RGB\", [pix.width, pix.height], pix.samples)\n",
    "        images.append(img)\n",
    "    \n",
    "    doc.close()\n",
    "    \n",
    "    return images\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "22c536b1-7886-4ea3-a6fd-94277400f1f4",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-04T12:55:22.760516Z",
     "iopub.status.busy": "2024-10-04T12:55:22.759837Z",
     "iopub.status.idle": "2024-10-04T12:55:24.326106Z",
     "shell.execute_reply": "2024-10-04T12:55:24.325353Z",
     "shell.execute_reply.started": "2024-10-04T12:55:22.760462Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "images = pdf_to_images(\"visuals/longvideobench.pdf\")[:9] #limit to 9 pages, removing appendix and references"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f36df969-66ec-4681-be73-8141c1cde468",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-04T12:55:24.327574Z",
     "iopub.status.busy": "2024-10-04T12:55:24.327375Z",
     "iopub.status.idle": "2024-10-04T12:55:24.331692Z",
     "shell.execute_reply": "2024-10-04T12:55:24.330909Z",
     "shell.execute_reply.started": "2024-10-04T12:55:24.327557Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from typing import List\n",
    "\n",
    "def get_placeholders_for_multiple_pages(images: List):\n",
    "    contents = []\n",
    "    for i, _ in enumerate(images):\n",
    "        contents.extend(\n",
    "            [\n",
    "                {\"text\": f\"Page {i+1}: \", \"type\": \"text\"},\n",
    "                {\"text\": None, \"type\": \"image\"},\n",
    "                {\"text\": \"\\n\", \"type\": \"text\"}\n",
    "            ]\n",
    "        )\n",
    "    return contents\n",
    "\n",
    "contents = get_placeholders_for_multiple_pages(images)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecbeaf58-cb80-4585-88f5-3e128fd117db",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Pages Visualization\n",
    "\n",
    "Let's visualize the paper as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b70c6225-c846-4fa6-a130-004905236367",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-04T12:55:24.332426Z",
     "iopub.status.busy": "2024-10-04T12:55:24.332254Z",
     "iopub.status.idle": "2024-10-04T12:55:24.336623Z",
     "shell.execute_reply": "2024-10-04T12:55:24.336075Z",
     "shell.execute_reply.started": "2024-10-04T12:55:24.332411Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from PIL import Image\n",
    "\n",
    "def create_image_gallery(images, columns=3, spacing=20, bg_color=(200, 200, 200)):\n",
    "    \"\"\"\n",
    "    Combine multiple images into a single larger image in a grid format.\n",
    "    \n",
    "    Parameters:\n",
    "        image_paths (list of str): List of file paths to the images to display.\n",
    "        columns (int): Number of columns in the gallery.\n",
    "        spacing (int): Space (in pixels) between the images in the gallery.\n",
    "        bg_color (tuple): Background color of the gallery (R, G, B).\n",
    "    \n",
    "    Returns:\n",
    "        PIL.Image: A single combined image.\n",
    "    \"\"\"\n",
    "    # Open all images and get their sizes\n",
    "    img_width, img_height = images[0].size  # Assuming all images are of the same size\n",
    "\n",
    "    # Calculate rows needed for the gallery\n",
    "    rows = (len(images) + columns - 1) // columns\n",
    "\n",
    "    # Calculate the size of the final gallery image\n",
    "    gallery_width = columns * img_width + (columns - 1) * spacing\n",
    "    gallery_height = rows * img_height + (rows - 1) * spacing\n",
    "\n",
    "    # Create a new image with the calculated size and background color\n",
    "    gallery_image = Image.new('RGB', (gallery_width, gallery_height), bg_color)\n",
    "\n",
    "    # Paste each image into the gallery\n",
    "    for index, img in enumerate(images):\n",
    "        row = index // columns\n",
    "        col = index % columns\n",
    "\n",
    "        x = col * (img_width + spacing)\n",
    "        y = row * (img_height + spacing)\n",
    "\n",
    "        gallery_image.paste(img, (x, y))\n",
    "\n",
    "    return gallery_image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "01f48017-0bed-43b9-9baa-ddb02c3783f1",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-02T05:55:43.142210Z",
     "iopub.status.busy": "2024-10-02T05:55:43.141088Z",
     "iopub.status.idle": "2024-10-02T05:55:43.238415Z",
     "shell.execute_reply": "2024-10-02T05:55:43.237739Z",
     "shell.execute_reply.started": "2024-10-02T05:55:43.142155Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "create_image_gallery(images).save(\"longvideobench_gallery.jpg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3f45e05-3ad9-420a-a459-fd46c35fa004",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Task 1: Find and Narrate Figures in the Paper\n",
    "\n",
    "The first task is to find and provide a description on all the figures in this paper, which is a non-replaceable ability an LMM has (compared with an OCR + LLM pipeline)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0b9c9faf-5753-49b6-a32a-c32289e60d19",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-04T12:55:24.337560Z",
     "iopub.status.busy": "2024-10-04T12:55:24.337315Z",
     "iopub.status.idle": "2024-10-04T12:55:43.540270Z",
     "shell.execute_reply": "2024-10-04T12:55:43.539643Z",
     "shell.execute_reply.started": "2024-10-04T12:55:24.337538Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_1195546/1755346060.py:13: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
      "  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n",
      "Processed prompts: 100%|██████████| 1/1 [00:16<00:00, 16.27s/it, est. speed input: 715.35 toks/s, output: 17.64 toks/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. **Figure 1**: This figure illustrates the concept of LONGVIDEOBENCH, a benchmark for long-context video-language understanding. It features a question about changes in a woman's backpack over time, with multiple frames showing different scenarios. The figure highlights the challenge of understanding long-term context in video data.\n",
      "\n",
      "2. **Figure 2**: This figure provides examples of the 17 categories of referring reasoning questions included in LONGVIDEOBENCH. Each example shows a question related to specific video contexts, demonstrating the diversity and complexity of the questions designed to test the models' understanding of long-term video content.\n",
      "\n",
      "3. **Figure 3**: This figure outlines the video and subtitle collection process for LONGVIDEOBENCH. It includes a flowchart showing the steps from downloading videos to annotating them, ensuring high-quality data for evaluating video-language models.\n",
      "\n",
      "4. **Figure 4**: This figure presents the accuracy of various models (both proprietary and open-source) across different video durations concerning the position of the queried moment within the video. The heatmaps compare the performance of models like GPT-4, Gemini, and others, showing how accuracy varies depending on whether the queried moment is closer to the video's start, middle, or end.<|im_end|>\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "\n",
    "messages = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            *contents,\n",
    "            {\"text\": \"Please narrate what each Figure (in total 4 Figures) is about in this paper.\", \"type\": \"text\"},\n",
    "        ],\n",
    "    }\n",
    "]\n",
    "\n",
    "text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)\n",
    "\n",
    "with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n",
    "    outputs = model.generate(\n",
    "            {\n",
    "                \"prompt_token_ids\": text,\n",
    "                \"multi_modal_data\": {\n",
    "                    \"image\": images,\n",
    "                    \"max_image_size\": 980,  # [Optional] The max image patch size, default `980`\n",
    "                    \"split_image\": True,  # [Optional] whether to split the images, default `False`\n",
    "                },\n",
    "            },\n",
    "            sampling_params=SamplingParams(max_tokens=4096, top_k=1, stop=[\"<|im_end|>\"])\n",
    "        )\n",
    "    generated_tokens = outputs[0].outputs[0].token_ids\n",
    "    result = tokenizer.decode(generated_tokens)\n",
    "\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85afc553-04bf-4f51-8ee4-677af87a7d58",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Task 2: Summarize the Paper\n",
    "\n",
    "\n",
    "The second task is to summarize this paper. Ideally, we would like this summarization not only from the abstract / introduction / conclusion parts of it, but also includes many important points that are iterated through this paper. \n",
    "\n",
    "And Aria is able to provide a summarization like that. See the results below and try on more papers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "734b9d86-8120-445c-8583-7c0d511a1b94",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-04T12:55:43.541127Z",
     "iopub.status.busy": "2024-10-04T12:55:43.540945Z",
     "iopub.status.idle": "2024-10-04T12:56:22.733672Z",
     "shell.execute_reply": "2024-10-04T12:56:22.732718Z",
     "shell.execute_reply.started": "2024-10-04T12:55:43.541111Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_1195546/1069219567.py:13: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
      "  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n",
      "Processed prompts: 100%|██████████| 1/1 [00:38<00:00, 38.63s/it, est. speed input: 301.05 toks/s, output: 19.44 toks/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The paper titled \"LONGVIDEOBENCH: A Benchmark for Long-context Interleaved Video-Language Understanding\" by Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li introduces a comprehensive benchmark for evaluating Large Multimodal Models (LMMs) in understanding long-duration videos. The paper highlights the challenges in processing long-context inputs and presents LONGVIDEOBENCH, a novel benchmark designed to address these challenges.\n",
      "\n",
      "### Key Points:\n",
      "\n",
      "1. **Introduction to LONGVIDEOBENCH**:\n",
      "   - The benchmark is introduced to measure the performance of LMMs on long-duration videos, which are videos up to an hour long.\n",
      "   - It includes 3,763 videos with subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding.\n",
      "\n",
      "2. **Referring Reasoning Task**:\n",
      "   - The benchmark focuses on the referring reasoning task, where models need to interpret and reason about specific video contexts.\n",
      "   - It includes 6,678 human-annotated multiple-choice questions divided into 17 fine-grained categories.\n",
      "\n",
      "3. **Challenges and Evaluation**:\n",
      "   - The paper identifies two primary challenges: retrieving details from long videos and reasoning contextual relations in long videos.\n",
      "   - The benchmark evaluates models on their ability to understand and respond to questions related to specific video contexts.\n",
      "\n",
      "4. **Dataset Construction**:\n",
      "   - The dataset includes videos of varying durations (8s, 15s, 60s, 180s, 600s, 900s, 3600s) and is collected from 119 channels.\n",
      "   - Videos are annotated with subtitles and questions to facilitate the referring reasoning task.\n",
      "\n",
      "5. **Annotation and Quality Control**:\n",
      "   - The annotation process involves multiple stages, including primary annotation, examination by an examiner, and revision by a reviser.\n",
      "   - This ensures high-quality annotations and consistent performance evaluation.\n",
      "\n",
      "6. **Evaluation Results**:\n",
      "   - The paper presents detailed evaluation results, showing that proprietary models generally outperform open-source models, especially for longer videos.\n",
      "   - It highlights the performance trends of different models across various video durations and question categories.\n",
      "\n",
      "7. **Conclusion**:\n",
      "   - LONGVIDEOBENCH provides valuable insights into the deficiencies of existing models and serves as a tool for guiding future research in multimodal video understanding.\n",
      "   - The benchmark is positioned as a significant asset for advancing the capabilities of LMMs in handling long-context video-language tasks.\n",
      "\n",
      "### Figures and Tables:\n",
      "- **Figure 1**: Illustrates the referring reasoning task with examples.\n",
      "- **Table 1**: Compares LONGVIDEOBENCH with other video QA benchmarks.\n",
      "- **Table 2**: Defines 17 categories of referring questions.\n",
      "- **Table 3 & 4**: Provide statistics of videos in LONGVIDEOBENCH by duration groups and video layouts.\n",
      "- **Figure 4**: Shows the accuracy of different models concerning query depth and video duration.\n",
      "\n",
      "Overall, the paper provides a thorough examination of the challenges and opportunities in evaluating LMMs on long-duration videos, offering a robust benchmark to facilitate this evaluation.<|im_end|>\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "\n",
    "messages = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            *contents,\n",
    "            {\"text\": \"Please provide an in-detail summary of the paper.\", \"type\": \"text\"},\n",
    "        ],\n",
    "    }\n",
    "]\n",
    "\n",
    "text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)\n",
    "\n",
    "with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n",
    "    outputs = model.generate(\n",
    "            {\n",
    "                \"prompt_token_ids\": text,\n",
    "                \"multi_modal_data\": {\n",
    "                    \"image\": images,\n",
    "                    \"max_image_size\": 980,  # [Optional] The max image patch size, default `980`\n",
    "                    \"split_image\": True,  # [Optional] whether to split the images, default `False`\n",
    "                },\n",
    "            },\n",
    "            sampling_params=SamplingParams(max_tokens=4096, top_k=1, stop=[\"<|im_end|>\"])\n",
    "        )\n",
    "    generated_tokens = outputs[0].outputs[0].token_ids\n",
    "    result = tokenizer.decode(generated_tokens)\n",
    "\n",
    "\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e28d721-94c8-40b3-be7e-1542bb07f962",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Task 3: Detailed Question-Answering\n",
    "\n",
    "As the third task, we provide an example for Aria to ask some detail-related question that are in the middle of this paper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "dcab3d93-73bc-4181-a302-370b741dcfc4",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-10-04T12:56:22.734771Z",
     "iopub.status.busy": "2024-10-04T12:56:22.734556Z",
     "iopub.status.idle": "2024-10-04T12:56:33.860689Z",
     "shell.execute_reply": "2024-10-04T12:56:33.860018Z",
     "shell.execute_reply.started": "2024-10-04T12:56:22.734753Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_1195546/1714091579.py:13: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.\n",
      "  with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n",
      "Processed prompts: 100%|██████████| 1/1 [00:10<00:00, 10.61s/it, est. speed input: 1097.33 toks/s, output: 17.25 toks/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The two major difficulties in understanding long videos, as outlined in the paper, are:\n",
      "\n",
      "1. **Retrieving details from long videos**: Existing Large Multimodal Models (LMMs) often struggle to extract specific details from long sequences. To accurately assess tasks in LONGVIDEOBENCH, there is a need for models to focus on granular details such as objects, events, or attributes, rather than providing a summary or topic overview.\n",
      "\n",
      "2. **Reasoning contextual relations in long videos**: Questions in LONGVIDEOBENCH require models to analyze the interconnections among diverse contents. This involves understanding the relationships among objects, events, or attributes within the video, which is significantly challenging for extensive inputs. The tasks demand models to derive the correct answer by examining the context and relations across multiple moments in the video.<|im_end|>\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "\n",
    "messages = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            *contents,\n",
    "            {\"text\": \"According to the paper, what are the two major difficulties in understanding long videos? Reply me in Latex format.\", \"type\": \"text\"},\n",
    "        ],\n",
    "    }\n",
    "]\n",
    "\n",
    "text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)\n",
    "\n",
    "with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):\n",
    "    outputs = model.generate(\n",
    "            {\n",
    "                \"prompt_token_ids\": text,\n",
    "                \"multi_modal_data\": {\n",
    "                    \"image\": images,\n",
    "                    \"max_image_size\": 980,  # [Optional] The max image patch size, default `980`\n",
    "                    \"split_image\": True,  # [Optional] whether to split the images, default `False`\n",
    "                },\n",
    "            },\n",
    "            sampling_params=SamplingParams(max_tokens=4096, top_k=1, stop=[\"<|im_end|>\"])\n",
    "        )\n",
    "    generated_tokens = outputs[0].outputs[0].token_ids\n",
    "    result = tokenizer.decode(generated_tokens)\n",
    "\n",
    "    \n",
    "print(result)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
